This report is submitted by Greeshma Jeev Koothuparambil and Olayemi Morrison as a part of Laboratory 2 of Visualization (732A98) Course for the 2023 Autumn Semester.
Assignment 1
Following are the libraries used for the successful completion of this assignment:
ggplot2
gridExtra
dplyr
ggpubr
plotly
grid
Here is how we loaded our libraries:
library(ggplot2)
library(gridExtra)
library(dplyr)
library(ggpubr)
library(plotly)
library(grid)
#reading the file.
df <- read.csv("olive.csv")
The loaded dataframe looks like this:
| X | Region | Area | palmitic | palmitoleic | stearic |
|---|---|---|---|---|---|
| 1 | 1 | North-Apulia | 1075 | 75 | 226 |
| 2 | 1 | North-Apulia | 1088 | 73 | 224 |
| 3 | 1 | North-Apulia | 911 | 54 | 246 |
| 4 | 1 | North-Apulia | 966 | 57 | 240 |
| 5 | 1 | North-Apulia | 1051 | 67 | 259 |
| 6 | 1 | North-Apulia | 911 | 49 | 268 |
| oleic | linoleic | linolenic | arachidic | eicosenoic |
|---|---|---|---|---|
| 7823 | 672 | 36 | 60 | 29 |
| 7709 | 781 | 31 | 61 | 29 |
| 8113 | 549 | 31 | 63 | 29 |
| 7952 | 619 | 50 | 78 | 35 |
| 7771 | 672 | 50 | 80 | 46 |
| 7924 | 678 | 51 | 70 | 44 |
It has 572 observations and 11 variables namely:
X, Region, Area, palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic and eicosenoic.
The Summary of the table is as follows:
## X Region Area palmitic
## Min. : 1.0 Min. :1.000 South-Apulia :206 Min. : 610
## 1st Qu.:143.8 1st Qu.:1.000 Inland-Sardinia: 65 1st Qu.:1095
## Median :286.5 Median :1.000 Calabria : 56 Median :1201
## Mean :286.5 Mean :1.699 Umbria : 51 Mean :1232
## 3rd Qu.:429.2 3rd Qu.:3.000 East-Liguria : 50 3rd Qu.:1360
## Max. :572.0 Max. :3.000 West-Liguria : 50 Max. :1753
## (Other) : 94
## palmitoleic stearic oleic linoleic
## Min. : 15.00 Min. :152.0 Min. :6300 Min. : 448.0
## 1st Qu.: 87.75 1st Qu.:205.0 1st Qu.:7000 1st Qu.: 770.8
## Median :110.00 Median :223.0 Median :7302 Median :1030.0
## Mean :126.09 Mean :228.9 Mean :7312 Mean : 980.5
## 3rd Qu.:169.25 3rd Qu.:249.0 3rd Qu.:7680 3rd Qu.:1180.8
## Max. :280.00 Max. :375.0 Max. :8410 Max. :1470.0
##
## linolenic arachidic eicosenoic
## Min. : 0.00 Min. : 0.0 Min. : 1.00
## 1st Qu.:26.00 1st Qu.: 50.0 1st Qu.: 2.00
## Median :33.00 Median : 61.0 Median :17.00
## Mean :31.89 Mean : 58.1 Mean :16.28
## 3rd Qu.:40.25 3rd Qu.: 70.0 3rd Qu.:28.00
## Max. :74.00 Max. :105.0 Max. :58.00
##
1. Create a scatterplot in Ggplot2 that shows dependence of Palmitic on Oleic in which observations are colored by Linoleic. Create also a similar scatter plot in which you divide Linoleic variable into fours classes (use cut_interval() ) and map the discretized variable to color instead. How easy/difficult is it to analyze each of these plots? What kind of perception problem is demonstrated by this experiment?
#ploting the first graph
p<-ggplot(df,aes(x=palmitic, y=oleic, color=linoleic)) +
geom_point()+
ggtitle("Dependency of Palmitic over Oleic based on Linoleic")
intervaldf <- data.frame(cut_interval(df$linoleic,n = 4))
colnames(intervaldf) <- "linoleicinterval4"
p1<-ggplot(df,aes(x=palmitic, y=oleic, color=intervaldf$linoleicinterval4)) +
geom_point()+
ggtitle("Dependency of Palmitic over Oleic based on Linoleic Level")+
labs(color = "Linoleic Interval")
The plot looks like this:
In the first plot “p”, it is easy to assume that the darker shade means a higher value upon initial inspection, but the legend shows that as the color gets lighter, the value increases. Also, because this is plotted with continuous values, there are some shades that are too similar to distinguish. Overlapping points also causes loss of data. As a perception problem, we are unable to recognize, organize or interpret the data.
The second plot “p1” is much easier to analyze because the Linoleic variable has been mapped to color instead. A clear boundary can be seen for each category.
2. Create scatterplots of Palmitic vs Oleic in which you map the discretized Linoleic with fourclasses to:
a. Color
b. Size
c. Orientation angle (use geom_spoke() )
State in which plots it is more difficult to differentiate between the categories and connect your findings to perception metrics (i.e. how many bits can be decoded by a specific aesthetics)
#plotting the second graph
t1 <- textGrob("Comparison of different graphs on \n Dependency of Palmitic over Oleic based on Linoleic Level")
p1<-ggplot(df,aes(x=palmitic, y=oleic, color=intervaldf$linoleicinterval4)) +
geom_point()+ labs(color = "Linoleic Interval")
p2<-ggplot(df,aes(x=palmitic, y=oleic, size=intervaldf$linoleicinterval4)) +
geom_point()+ labs(size = "Linoleic Interval")
p3<-ggplot(df,aes(x=palmitic, y=oleic)) +
geom_point()+ geom_spoke(aes(angle = as.integer(intervaldf$linoleicinterval4)), radius = 50)
graphs <- arrangeGrob(grobs = list(t1,p1,p2,p3), ncol = 1,nrow = 4,heights = c(4,10,10,10))
palmiticXoleicXlinoleic <- as_ggplot(graphs)
The resulting graph looks interesting:
The plots “p2” and “p3” in which the discrete variable plotted is based on size and spokes are really hard to differentiate between the categories. It is extremely difficult because the circles and spokes overlap and data is lost on this graph due to overplotting naking it difficult to distinguish the boundaries between observations. It is impossible to recognize, organize or interpret the data in this way.
According to the perception metrics colors can be recogonised upto 10 hues and here we are using only 4 making it easier to distinguish them. log4 = 2 bits are easier for perception. In the case of size based scatter plot even though it says we can recognise upto 5 sizes it is hard in here because of the overplotting issue. here even 2 bits of channel capacity seems hard. Same is the case with spokes. Even though human perception has a 3 bit capacity in analysing line orientation overlapping of values makes it hard to understand the orientation of every spoke.
3. Create a scatterplot of Oleic vs Eicosenoic in which color is defined by numeric values of Region. What is wrong with such a plot? Now create a similar kind of plot in which Region is a categorical variable. How quickly can you identify decision boundaries? Does preattentive or attentive mechanism make it possible?
#plotting the third graph
t2 <- textGrob("Comparison of different graphs on \n Dependency of Oleic over Eicosenoic based on Region")
p4<-ggplot(df,aes(x=oleic, y= eicosenoic, color=Region)) +
geom_point()
p5<-ggplot(df,aes(x=oleic, y= eicosenoic, color=as.factor(Region))) +
geom_point()
graphs <- arrangeGrob(grobs = list(t2,p4,p5), ncol = 1,nrow = 3,heights = c(4,10,10))
oleicXeicosenoicXRegion <- as_ggplot(graphs)
The generated scatter plots looks like ths:
Creating the scatterplot by defining numeric values of Region makes it more difficult to interpret the data. Even though there are clear boundaries, we cannot identify which shade of color is mapped to its region. However, creating the scatterplot with Region as a categorical variable makes it quicker to identify the boundaries using a preattentive mechanism to understand which color refers to a region.
4. Create a scatterplot of Oleic vs Eicosenoic in which color is defined by a discretized Linoleic (3 classes), shape is defined by a discretized Palmitic (3 classes) and size is defined by a discretized Palmitoleic (3 classes). How difficult is it to differentiate between 27=3* 3* 3 different types of observations? What kind of perception problem is demonstrated by this graph?
#plotting the fourth graph
intervaldf$linoleicinterval3 <- cut_interval(df$linoleic,n = 3)
intervaldf$palmiticinterval3 <- cut_interval(df$palmitic,n = 3)
intervaldf$palmitoleicinterval3 <- cut_interval(df$palmitoleic,n = 3)
p6<-ggplot(df,aes(x=oleic, y= eicosenoic, color=intervaldf$linoleicinterval3,
shape = intervaldf$palmiticinterval3, size = intervaldf$palmitoleicinterval3)) +
geom_point()+
ggtitle("Dependency of Oleic over Eicosenoic based on \nLinoleic Level, Palmiticin and Palmitoleic")+
labs(color = "Linoleic Interval", shape = "Palmiticin Interval", size = "Palmitoleic Interval")
The dependency of Oleic over Eicosenoic based on different Linoleic Level, Palmiticin and Palmitoleic is shown in the following plot:
It is very difficult to differentiate between the 3 * 3 * 3 classes of observation because dealing with a large number of categories or dimensions can be mentally taxing for viewers to process and remember the distinctions between them. There’s also a risk of misinterpretation from using too many colors and sizes, which can lead to confusion, and some visual cues are less effective than others. There is a relative judgement error to be noted in the graph. The plotted values have similar positions along the common scale which make them difficult to be recogonised. Added to that problem is areal recognition. Too many features to a single value is tiresome for the brain as the attendive perception uses only short term memory. Even though palmiticin interval uses different hues for effective recognition it does not say anything about its value scale. Seeing different colors would make brain interpret the values to be independent of each other and overlook the fact that each color represent a value in the amount of Palmiticin in the oil. Here perspective problem arises.
5. Create a scatterplot of Oleic vs Eicosenoic in which color is defined by Region, shape is defined by a discretized Palmitic (3 classes) and size is defined by a discretized Palmitoleic (3 classes). Why is it possible to clearly see a decision boundary between Regions despite many aesthetics are used? Explain this phenomenon from the perspective of Treisman’s theory.
#plotting the fifth graph
p7<-ggplot(df,aes(x=oleic, y= eicosenoic, color=Region,
shape = intervaldf$palmiticinterval3, size = intervaldf$palmitoleicinterval3)) +
geom_point()+
ggtitle("Dependency of Oleic over Eicosenoic based on \nRegion, Palmiticin and Palmitoleic")+
labs(color = "Region", shape = "Palmiticin Interval", size = "Palmitoleic Interval")
The dependency of Oleic over Eicosenoic based on different Region, Palmiticin and Palmitoleic is shown in the following plot:
It is possible to see a decision boundary clearly because according to Treisman’s theory, it is possible to differentiate between colors quickly, which is very obvious in this plot. Once the boundaries have been identified, we can then use focused attention to identify various shapes and gather more data
6. Use Plotly to create a pie chart that shows the proportions of oils coming from different Areas. Hide labels in this plot and keep only hover-on labels. Which problem is demonstrated by this graph?
#plotting the sixth graph
p8 <- plot_ly(data=df,labels=~factor(Area), type = "pie", hoverinfo ='label', textinfo = "none")%>%
layout(title= "Areal Distribution of Oil Production", showlegend = F)
The resulting graph is as follows:
Human perception cannot be relied on when observing this graph because we cannot accurately determine the angle or size of each section by just looking at it. Relative judgement based on angle in compromised here. It causes scaling and perspective problem to arise.
7. Create a 2d-density contour plot with Ggplot2 in which you show dependence of Linoleic vs Eicosenoic. Compare the graph to the scatterplot using the same variables and comment why this contour plot can be misleading.
#plotting the seventh graph
t3 <- textGrob("Comparison of Density contour and Relational Scatter plot between Linoleic and Eicosenoic")
p9<-ggplot(df,aes(x=linoleic, y= eicosenoic)) +
geom_density_2d()
p10<-ggplot(df,aes(x=linoleic, y= eicosenoic)) +
geom_point()
graphs1 <- arrangeGrob(grobs = list(t3,p9,p10), ncol = 1,nrow = 3, heights = c(1,10,10))
linoleicXeicosenoic <- as_ggplot(graphs1)
The resulting graph is as follows:
Comparing the two graphs, it is seen that the density plot is obscuring certain patterns. in areas of high density on the scatterplot, the density plot displays a lower value of density, or no data is represented at all. The appearance of the density plot can be highly sensitive to the arrangement and distribution of data points, where the appearance of clusters and patterns may differ, potentially leading to misinterpretation.
Assignment 2
Following are the libraries used for the successful completion of this assignment:
plotly
MASS
xlsx
tidyr
Here is how we loaded our libraries:
library(plotly)
library(MASS)
library(xlsx)
library(tidyr)
1. Load the file to R and answer whether it is reasonable to scale these data in order to perform a multidimensional scaling (MDS).
#Reading the baseball data
baseball = read.xlsx(file = "baseball-2016.xlsx",sheetIndex = 1)
row.names(baseball) <- baseball$Team
The loaded dataframe looks like this:
| Team | League | Won | Lost | Runs.per.game | HR.per.game | |
|---|---|---|---|---|---|---|
| Aizona Diamondbacks | Aizona Diamondbacks | NL | 69 | 93 | 4.64 | 1.172840 |
| Atlanta Braves | Atlanta Braves | NL | 68 | 93 | 4.03 | 0.757764 |
| Baltimore Orioles | Baltimore Orioles | AL | 89 | 73 | 4.59 | 1.561728 |
| Boston Red Sox | Boston Red Sox | AL | 93 | 69 | 5.42 | 1.283951 |
| Chicago Cubs | Chicago Cubs | NL | 103 | 58 | 4.99 | 1.236025 |
| Chicago White Sox | Chicago White Sox | AL | 78 | 84 | 4.23 | 1.037037 |
| AB | Runs | Hits | X2B | X3B | HR | |
|---|---|---|---|---|---|---|
| Aizona Diamondbacks | 5665 | 752 | 1479 | 285 | 56 | 190 |
| Atlanta Braves | 5514 | 649 | 1404 | 295 | 27 | 122 |
| Baltimore Orioles | 5524 | 744 | 1413 | 265 | 6 | 253 |
| Boston Red Sox | 5670 | 878 | 1598 | 343 | 25 | 208 |
| Chicago Cubs | 5503 | 808 | 1409 | 293 | 30 | 199 |
| Chicago White Sox | 5550 | 686 | 1428 | 277 | 33 | 168 |
| RBI | StolenB | CaughtS | BB | SO | BAvg | |
|---|---|---|---|---|---|---|
| Aizona Diamondbacks | 709 | 137 | 31 | 463 | 1427 | 0.261 |
| Atlanta Braves | 615 | 75 | 34 | 502 | 1240 | 0.255 |
| Baltimore Orioles | 710 | 19 | 13 | 468 | 1324 | 0.256 |
| Boston Red Sox | 836 | 83 | 24 | 558 | 1160 | 0.282 |
| Chicago Cubs | 767 | 66 | 34 | 656 | 1339 | 0.256 |
| Chicago White Sox | 656 | 77 | 36 | 455 | 1285 | 0.257 |
| OBP | SLG | OPS | TB | GDP | HBP | |
|---|---|---|---|---|---|---|
| Aizona Diamondbacks | 0.320 | 0.432 | 0.752 | 2446 | 117 | 50 |
| Atlanta Braves | 0.321 | 0.384 | 0.705 | 2119 | 145 | 59 |
| Baltimore Orioles | 0.317 | 0.443 | 0.760 | 2449 | 119 | 44 |
| Boston Red Sox | 0.348 | 0.461 | 0.810 | 2615 | 137 | 43 |
| Chicago Cubs | 0.343 | 0.429 | 0.772 | 2359 | 107 | 96 |
| Chicago White Sox | 0.317 | 0.410 | 0.727 | 2275 | 122 | 53 |
| SH | SF | IBB | LOB | |
|---|---|---|---|---|
| Aizona Diamondbacks | 43 | 38 | 43 | 1113 |
| Atlanta Braves | 64 | 52 | 60 | 1161 |
| Baltimore Orioles | 17 | 36 | 19 | 1065 |
| Boston Red Sox | 8 | 40 | 34 | 1162 |
| Chicago Cubs | 42 | 37 | 45 | 1217 |
| Chicago White Sox | 29 | 44 | 16 | 1105 |
It has 30 observations and 28 variables namely:
Team, League, Won, Lost, Runs.per.game, HR.per.game, AB, Runs, Hits, X2B, X3B, HR, RBI, StolenB, CaughtS, BB, SO, BAvg, OBP, SLG, OPS, TB, GDP, HBP, SH, SF, IBB and LOB.
The Summary of the table is as follows:
## Team League Won Lost
## Aizona Diamondbacks: 1 AL:15 Min. : 59.0 Min. : 58.0
## Atlanta Braves : 1 NL:15 1st Qu.: 71.5 1st Qu.: 73.5
## Baltimore Orioles : 1 Median : 82.5 Median : 79.5
## Boston Red Sox : 1 Mean : 80.9 Mean : 80.9
## Chicago Cubs : 1 3rd Qu.: 88.5 3rd Qu.: 90.5
## Chicago White Sox : 1 Max. :103.0 Max. :103.0
## (Other) :24
## Runs.per.game HR.per.game AB Runs Hits
## Min. :3.770 Min. :0.7578 Min. :5330 Min. :610.0 Min. :1275
## 1st Qu.:4.178 1st Qu.:1.0185 1st Qu.:5482 1st Qu.:676.2 1st Qu.:1369
## Median :4.465 Median :1.1852 Median :5521 Median :723.0 Median :1410
## Mean :4.478 Mean :1.1556 Mean :5519 Mean :724.8 Mean :1409
## 3rd Qu.:4.705 3rd Qu.:1.3039 3rd Qu.:5550 3rd Qu.:762.0 3rd Qu.:1444
## Max. :5.420 Max. :1.5617 Max. :5670 Max. :878.0 Max. :1598
##
## X2B X3B HR RBI
## Min. :231.0 Min. : 6.0 Min. :122.0 Min. :575.0
## 1st Qu.:257.5 1st Qu.:21.0 1st Qu.:165.0 1st Qu.:647.5
## Median :276.5 Median :29.0 Median :192.0 Median :687.5
## Mean :275.2 Mean :29.1 Mean :187.0 Mean :691.5
## 3rd Qu.:288.0 3rd Qu.:33.0 3rd Qu.:210.2 3rd Qu.:731.8
## Max. :343.0 Max. :56.0 Max. :253.0 Max. :836.0
##
## StolenB CaughtS BB SO
## Min. : 19.00 Min. :13.00 Min. :382.0 Min. : 991
## 1st Qu.: 58.50 1st Qu.:26.50 1st Qu.:452.8 1st Qu.:1228
## Median : 76.00 Median :34.00 Median :498.0 Median :1302
## Mean : 84.57 Mean :33.37 Mean :502.9 Mean :1299
## 3rd Qu.:108.00 3rd Qu.:38.50 3rd Qu.:534.8 3rd Qu.:1356
## Max. :181.00 Max. :56.00 Max. :656.0 Max. :1543
##
## BAvg OBP SLG OPS
## Min. :0.2350 Min. :0.2990 Min. :0.3840 Min. :0.6850
## 1st Qu.:0.2482 1st Qu.:0.3160 1st Qu.:0.4027 1st Qu.:0.7245
## Median :0.2560 Median :0.3215 Median :0.4170 Median :0.7335
## Mean :0.2553 Mean :0.3214 Mean :0.4174 Mean :0.7387
## 3rd Qu.:0.2607 3rd Qu.:0.3282 3rd Qu.:0.4300 3rd Qu.:0.7558
## Max. :0.2820 Max. :0.3480 Max. :0.4610 Max. :0.8100
##
## TB GDP HBP SH SF
## Min. :2090 Min. : 88.0 Min. :33.00 Min. : 8.00 Min. :28.00
## 1st Qu.:2212 1st Qu.:114.8 1st Qu.:44.25 1st Qu.:24.50 1st Qu.:36.00
## Median :2292 Median :122.5 Median :53.00 Median :35.50 Median :39.50
## Mean :2304 Mean :124.0 Mean :55.03 Mean :34.17 Mean :40.47
## 3rd Qu.:2387 3rd Qu.:136.5 3rd Qu.:61.25 3rd Qu.:42.75 3rd Qu.:43.75
## Max. :2615 Max. :153.0 Max. :96.00 Max. :64.00 Max. :63.00
##
## IBB LOB
## Min. :16.00 Min. : 965
## 1st Qu.:21.50 1st Qu.:1061
## Median :31.00 Median :1102
## Mean :31.07 Mean :1097
## 3rd Qu.:38.50 3rd Qu.:1120
## Max. :60.00 Max. :1217
##
Scaling is necessary because we have values ranging from 0.235 up to 5670, which makes the data very complex and difficult to visualize in its raw form.
2. Write an R code that performs a non-metric MDS with Minkowski distance=2 of the data (numerical columns) into two dimensions. Visualize the resulting observations in Plotly as a scatter plot in which observations are colored by League. Does it seem to exist a difference between the leagues according to the plot? Which of the MDS components seem to provide the best differentiation between the Leagues? Which baseball teams seem to be outliers?
#Plotting the first graph
baseball.numeric= scale(baseball[,3:27])
d=dist(baseball.numeric)
res=isoMDS(d,k = 2, p=2)
## initial value 19.778879
## iter 5 value 16.074932
## iter 10 value 15.763031
## final value 15.692462
## converged
coords=res$points
coordsMDS=as.data.frame(coords)
coordsMDS$Team=rownames(coordsMDS)
coordsMDS$League=baseball$League
b1 <- plot_ly(coordsMDS, x=~V1, y=~V2, type="scatter",mode= "markers" ,color= ~League, hovertext = ~Team, colors = "Set1")
The Scatter plot looks as follows:
The position of the teams appear to be almost equally distributed among the leagues. Even then the V2 variable some how can differentiate the Leagues by a faint boundary. Around -1.5 value of the V2 variable we can differentiate between two leagues. Based on the here defined boundary some Teams from the NL League can be categorised as outliers like St. Louis Cardinals, NY Mets, Los Angels Dodgers,San Diego Padres,Philadelphia Phillies, Milwaukee Brewers and Chicago cubs which are located beyond the boundary.
3. Use Plotly to create a Shepard plot for the MDS performed and comment about how successful the MDS was. Which observation pairs were hard for the MDS to map successfully?
#Plotting the second graph
sh <- Shepard(d, coords)
delta <-as.numeric(d)
D<- as.numeric(dist(coords))
n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n)
index1=as.numeric(index[lower.tri(index)])
n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n, byrow = T)
index2=as.numeric(index[lower.tri(index)])
b2 <- plot_ly()%>%
add_markers(x=~delta, y=~D, hoverinfo = 'text',
text = ~paste('Obj1: ', rownames(baseball)[index1],
'<br> Obj 2: ', rownames(baseball)[index2]))%>%
add_lines(x=~sh$x, y=~sh$yf)
The Shepherd’s Plot is as below:
The MDS does not seem very successful because a strong linear relationship between dissimilarities was not observed. This can be seen clearly because the shepard plot’s line does not closely follow a straight diagonal line, despite the data points forming a close cluster around the line. The most difficult observation pairs to map successfully were the Minnesota Twins and Aizona Diamondbacks. This pair is seen as an outlier despite not having the highest dissimilarities.
4. Produce series of scatterplots in which you plot the MDS variable that was the best in the differentiation between the leagues in step 2 against all other numerical variables of the data. Pick up two scatterplots that seem to show the strongest (positive or negative)
# Plotting the third graph
baseball$SuperVar <- coordsMDS$V2
b3 <- baseball %>%
gather(-Team, -League, -SuperVar, key = "var", value = "value") %>%
ggplot(aes(x = value, y = SuperVar)) +
geom_point(color ="Blue") +
facet_wrap(~ var, ncol = 7, scales = "free")
The grid plot looks as below:
The two strongest scatterplots are as follows:
SuperVar against SF: On this graph, most of the data points are clustered from left to right in a negative slope, in between values ranging from 30 to 50, with a few outliers located around the 60 mark.
SuperVar against X3B: Here the data points are more clustered towards the center of the graph, in between values of 18 and 40. Some extreme outliers are seen below 10 and around 50.
These two graphs were chosen because they demonstrate a close relationship between the two variables being plotted, making it easier to make predictions and draw meaningful conclusions about the data.
For the first assignment coding was done by Greeshma Jeev and the Analysis part was done by Olayemi. We both went through the outputs and the analysis to make our own suggestions to the results inorder to make this report a grand success.
As for the second assignment since we both are new to the MDS and its application in R, we both sat together and learned on various aspects of MDS and its coding in R by going through Lecture Slides, Textbooks and Web browsing. Different ambiguities aroused while working on it and they were cleared by discussing with different classmates and those which remained even after discussions were rectified by Mr Oleg. After getting a clearer understanding of the assignment the coding for the assignment was done by Greeshma Jeev. As for most of the coding in this assignment templates were already available we both found more time in discussing and defending our analysis and findings in the assignment.
The RMD file was designed together and coded by Greeshma Jeev. Content writing was done by both Olayemi and Greeshma Jeev.
APPENDIX
library(ggplot2)
library(gridExtra)
library(dplyr)
library(ggpubr)
library(plotly)
library(grid)
# Read the file
df <- read.csv("olive.csv")
sumdf <- summary(df)
#ploting the first graph
p<-ggplot(df,aes(x=palmitic, y=oleic, color=linoleic)) +
geom_point()+
ggtitle("Dependency of Palmitic over Oleic based on Linoleic")
intervaldf <- data.frame(cut_interval(df$linoleic,n = 4))
colnames(intervaldf) <- "linoleicinterval4"
p1<-ggplot(df,aes(x=palmitic, y=oleic, color=intervaldf$linoleicinterval4)) +
geom_point()+
ggtitle("Dependency of Palmitic over Oleic based on Linoleic Level")+
labs(color = "Linoleic Interval")
#plotting the second graph
t1 <- textGrob("Comparison of different graphs on \n Dependency of Palmitic over Oleic based on Linoleic Level")
p1<-ggplot(df,aes(x=palmitic, y=oleic, color=intervaldf$linoleicinterval4)) +
geom_point()+ labs(color = "Linoleic Interval")
p2<-ggplot(df,aes(x=palmitic, y=oleic, size=intervaldf$linoleicinterval4)) +
geom_point()+ labs(size = "Linoleic Interval")
p3<-ggplot(df,aes(x=palmitic, y=oleic)) +
geom_point()+ geom_spoke(aes(angle = as.integer(intervaldf$linoleicinterval4)), radius = 50)
graphs <- arrangeGrob(grobs = list(t1,p1,p2,p3), ncol = 1,nrow = 4,heights = c(4,10,10,10))
palmiticXoleicXlinoleic <- as_ggplot(graphs)
palmiticXoleicXlinoleic
#plotting the third graph
t2 <- textGrob("Comparison of different graphs on \n Dependency of Oleic over Eicosenoic based on Region")
p4<-ggplot(df,aes(x=oleic, y= eicosenoic, color=Region)) +
geom_point()
p5<-ggplot(df,aes(x=oleic, y= eicosenoic, color=as.factor(Region))) +
geom_point()
graphs <- arrangeGrob(grobs = list(t2,p4,p5), ncol = 1,nrow = 3,heights = c(4,10,10))
oleicXeicosenoicXRegion <- as_ggplot(graphs)
oleicXeicosenoicXRegion
#plotting the fourth graph
intervaldf$linoleicinterval3 <- cut_interval(df$linoleic,n = 3)
intervaldf$palmiticinterval3 <- cut_interval(df$palmitic,n = 3)
intervaldf$palmitoleicinterval3 <- cut_interval(df$palmitoleic,n = 3)
p6<-ggplot(df,aes(x=oleic, y= eicosenoic, color=intervaldf$linoleicinterval3,
shape = intervaldf$palmiticinterval3, size = intervaldf$palmitoleicinterval3)) +
geom_point()+
ggtitle("Dependency of Oleic over Eicosenoic based on Linoleic Level, Palmiticin and Palmitoleic")+
labs(color = "Linoleic Interval", shape = "Palmiticin Interval", size = "Palmitoleic Interval")
#plotting the fifth graph
p7<-ggplot(df,aes(x=oleic, y= eicosenoic, color=Region,
shape = intervaldf$palmiticinterval3, size = intervaldf$palmitoleicinterval3)) +
geom_point()+
ggtitle("Dependency of Oleic over Eicosenoic based on Region, Palmiticin and Palmitoleic")+
labs(color = "Region", shape = "Palmiticin Interval", size = "Palmitoleic Interval")
#plotting the sixth graph
p8 <- plot_ly(data=df,labels=~factor(Area), type = "pie", hoverinfo ='label', textinfo = "none")%>%
layout(title= "Areal Distribution of Oil Production", showlegend = F)
#plotting the seventh graph
t3 <- textGrob("Comparison of Density contour and Relational Scatter plot between Linoleic and Eicosenoic")
p9<-ggplot(df,aes(x=linoleic, y= eicosenoic)) +
geom_density_2d()
p10<-ggplot(df,aes(x=linoleic, y= eicosenoic)) +
geom_point()
graphs1 <- arrangeGrob(grobs = list(t3,p9,p10), ncol = 1,nrow = 3, heights = c(1,10,10))
linoleicXeicosenoic <- as_ggplot(graphs1)
library(plotly)
library(MASS)
library(xlsx)
library(tidyr)
#Reading the baseball data
baseball = read.xlsx(file = "baseball-2016.xlsx",sheetIndex = 1)
row.names(baseball) <- baseball$Team
summary(baseball)
#Plotting the first graph
baseball.numeric= scale(baseball[,3:27])
d=dist(baseball.numeric)
res=isoMDS(d,k = 2, p=2)
coords=res$points
coordsMDS=as.data.frame(coords)
coordsMDS$Team=rownames(coordsMDS)
coordsMDS$League=baseball$League
b1 <- plot_ly(coordsMDS, x=~V1, y=~V2, type="scatter",mode= "markers" ,color= ~League, hovertext = ~Team, colors = "Set1")
#Plotting the second graph
sh <- Shepard(d, coords)
delta <-as.numeric(d)
D<- as.numeric(dist(coords))
n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n)
index1=as.numeric(index[lower.tri(index)])
n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n, byrow = T)
index2=as.numeric(index[lower.tri(index)])
b2 <- plot_ly()%>%
add_markers(x=~delta, y=~D, hoverinfo = 'text',
text = ~paste('Obj1: ', rownames(baseball)[index1],
'<br> Obj 2: ', rownames(baseball)[index2]))%>%
add_lines(x=~sh$x, y=~sh$yf)
# Plotting the third graph
baseball$SuperVar <- coordsMDS$V2
b3 <- baseball %>%
gather(-Team, -League, -SuperVar, key = "var", value = "value") %>%
ggplot(aes(x = value, y = SuperVar)) +
geom_point(color ="Blue") +
facet_wrap(~ var, ncol = 7, scales = "free")